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Abstract 

Classifiers with rejection are essential in real-world applications where misclassifications and their effects are 
critical. However, if no problem specific cost function is defined, there are no established measures to assess the 
performance of such classifiers. We introduce a set of desired properties for performance measures for classifiers 
with rejection, based on which we propose a set of three performance measures for the evaluation of the performance 
of classifiers with rejection that satisfy the desired properties. The nonrejected accuracy measures the ability of the 
classifier to accurately classify nonrejected samples; the classification quality measures the correct decision making 
of the classifier with rejector; and the rejection quality measures the ability to concentrate all misclassified samples 
onto the set of rejected samples. From the measures, we derive the concept of relative optimality that allows us 
to connect the measures to a family of cost functions that take into account the trade-off between rejection and 
misclassification. We illustrate the use of the proposed performance measures on classifiers with rejection applied 
to synthetic and real-world data. 


I. Introduction 

Classification with rejection is a viable option in real world applications of machine learning and pattern 
recognition, where the presence and cost of errors can be detrimental to performance. This includes 
situations where the need to classify, in other words, when the cost of misclassifying is high (as in 
automated medical diagnosis [1], [2] or in landcover classification in remote sensing [3], [4]), or where 
samples might be of no interest to the application (as in image retrieval [5]). A classifier with rejection 
can also cope with unknown information, reducing the threat posed by the existence of unknown samples 
or mislabeled training samples that hamper the classifier’s performance. 

Classification with rejection was first analyzed in [6], where a rule for optimum error-reject trade-off 
was presented, Chow’s rule. In a binary classification setting, Chow’s rule allows for the determination of 
a threshold for rejection such that the classification risk is minimized. This requires both the knowledge of 
the a posterior probabilities and the existence of a cost function that specifies the cost of misclassification 
and the cost of rejection. 

Multiple other designs for incorporating rejection into classification exist. In a binary classification 
setting, the reject option can be embedded in the classifier. An embedded reject option is possible through 
a risk minimization approach with the use of a hinge function, such as in [7], [8], [9], to minimize 
classification risk. It can also be achieved with support vector machines with embedded reject options, as 
described in [10], [11], [12]. These embedded designs can also be extended to a rejection framework in 
nonbinary classification setting [13]. 

There is no standard measure for the assessment of the performance of a classifier with rejection. 
Accuracy-rejection curves, used in [14], [10], [15], [16], and their variants based on the analysis of the 
Fi score, used in [17], [13], albeit popular in practical applications of classification with rejection have 
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significant drawbacks. Obtaining sufficient points for an accuracy rejection curve might not be feasible 
for classifiers with embedded reject option, which require retraining the classifier to achieve a different 
rejection ratio, or for classifiers that combine contextual information with rejection, such as [2], [4]. This 
means that accuracy-rejection curves and the Fi rejection curves, in the real world, are not able to describe 
the behavior of the classifier with rejection in all cases. 

In [18], a different approach is taken. A 3D ROC (reeeiver operating eharacteristic) plot of a 2D ROC 
surfaee is obtained by decomposing the false positive rate into false positive rate for outliers belonging to 
known classes and false positive rate for outliers belonging to unknown classes, with the VUC (volume 
under the curve) as the performance measure. The use of ROC curves for the analysis of the performance 
suffers from the same problems associated with accuracy-rejection curves. 

To fill this gap, we propose a set of desired properties for performance measures for classifiers with 
rejection, and a set of three performance measures that satisfy those properties. 

A performance measure that evaluates the performance of a rejection mechanism given a classifier 
should satisfy the following: 

• Property I — be a function of the fraction of rejected samples; 

• Property II — be able to compare different rejection mechanisms working at the same fraction of 
rejected samples; 

• Property III — be able to compare rejection mechanisms working at a different fraetions of rejected 
samples when one rejection mechanism outperforms the other; 

• Property IV — be maximum for a rejection mechanism that no other feasible rejection mechanism 
outperforms, and minimum for a rejection mechanism that all other feasible rejection mechanisms 
outperform. 

These properties rely on being able to state whether one rejection mechanism qualitatively outperforms 
the other. If a cost function exists that takes in account the cost of rejection and misclassification, the 
concept of outperformance is trivial, and this cost function not only satisfies the properties but is also 
the ideal performance measures for the problem in hand. It might not be feasible, however, to design a 
cost function for each individual classification problem. Thus, we derive a set of cases where the concept 
of outperformance is independent from a specific cost function (under the assumption that the cost of 
rejection is never greater than the cost of misclassification). 

With the properties and the concept of outperformance in place, we present three measures that satisfy 
the above properties: 

• Nonrejected accuracy measures the ability of the classifier to accurately classify nonrejected samples; 

• Classification quality measures the ability of the classifier with rejection to accurately classify 
nonrejected samples and to reject misclassified samples; 

• Rejection quality measures the ability of the classifier with rejection to make errors on rejected 
samples only. 

With the three measures in place, we can explore the best and worst case scenarios for each measure, 
for a given reference classifier with rejection. We denote the proximity of a classifier with rejection to its 
best and worst case scenarios, with regard to a reference classifier with rejection, as relative optimality. 
This allows us to easily connect performance measures to problem specifie cost functions. For a classifier 
with rejection that rejects at two different numbers of rejected samples, the relative optimality defines the 
family of cost functions on which rejection at one number rejected samples is better, equal, or worse than 
rejection at the other number of rejected samples. 

The rest of the paper is structured as follows. In Section II, we present the classifier with rejection; we 
introduce the necessary notation in Section II-A, define the three coneepts of rejector outperformance that 
do not depend on cost functions in Section II-B, and present the desired performance measure properties 
in Section II-C. In Section III, we present the set of proposed performance measures. In Section IV, 
we connect the performance measures to cost functions by defining relative optimality. In Section V, we 
illustrate performanee measures on real-world applieations. In Section VI, we conclude the paper. 
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II. Classifiers with rejection 

A. Notation 

A classifier with rejection can be seen as a coupling of a classifier C with a rejection system R. The 
classification maps n d-dimensional feature vectors x into n labels C{x) : {1,..., df}”, such 

that 

y = c{x), 

where y denotes a labeling. The rejector R maps the classification (feature vectors and associated labels) 
into a binary rejection vector, R{x, y) : x (1,..., A'}” {0,1}**, such that 

r = R{x,y), 

where r denotes the binary rejection vector. We define a classification with rejection y^ as 

-R ^ IVi^ if n = 0, 

|0, if Vi = 1, 

where r* corresponds to the binary decision to reject (r* = 1) or not (r* = 0) the Ah classification, and 
yf^ = 0 denotes rejection. 

By comparing the classification y with its ground truth y, we form a binary n-dimensional accuracy 
vector a, such that a* measures whether the Ah sample is classified accurately. The binary vector a 
imposes a partition of the set of samples in two subsets A and Af, namely the subset of accurately 
classified samples and the subset of misclassified samples. Let c be a confidence vector associated with 
the classification y, such that 

a > Cj Vi < Vj, 

this is, if sample i is rejected, then all the samples j with smaller confidence q < cj are also rejected. 
We thus have the ground truth y, the result of the classification y, and the result of the classification with 
rejection y^. 

Let c denote the reordering of the confidence vector c in decreasing order. If we keep k samples with 
the highest confidence and reject the rest n — k samples, we obtain two subsets: k nonrejected samples 
and n — k rejected samples, Af and R^ respectively. Our goal is to separate the accuracy vector a into 
two subvectors (a^/" and a-jz.), based on the confidence vector c such that all misclassifications are in the 
ttTz subvector, and all accurate classifications are in the aj\j- subvector. We should note that, since AT and 
R have disjoint supports, 

||®|| = ||®A/'|| + ||®7?.||; (1) 

for all A/", R such that A/"n7?. = 0 and AfuR = A}. As we only work with the norm of binary 

vectors, we point that ||a||o= ||«||i; for simplicity, we omit the subscript. 



Fig. 1. Partition of the sample space based on the performance of the (a) classification only (partition space A and A4); (b) rejection 
only (partition space 1Z and A/"); and (c) classification with rejection. Green corresponds to accurately classified samples and orange to 
misclassified samples. Gray corresponds to rejected samples and white to nonrejected samples. 


^We note that R corresponds to a rejector, a function that maps classification into a binary rejection vector, whereas 71 denotes a set of 
samples that are rejected. 
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With the partitioning of the sample space into A and M according to the values of the binary vector a, 
and the partitioning of the sample space into Af and 7t, we thus partition the sample space as in Fig. 1: 

• AnAf: samples accurately classified and not rejected; the number of such samples is |.AnA/’| = ||aAr|| 

• AinAf: samples misclassified and not rejected; the number of such samples is |AdnA/’| = ||1 —aAr|| 

• AnTZ: samples accurately classified and rejected; the number of such samples is |.An7^| = ||a 7 ^|| 

• M DTI: samples misclassified and rejected; the number of such samples is \M. fi7^| = ||1 — a-nW 

B. Comparing classifiers with Rejection 

The comparison of the performance of two rejectors is nontrivial. It depends on the existence of a 
problem specific cost function that takes in account the trade-off between misclassification and rejection. 
If a cost function exists, the performance is linked to the comparison of the cost function evaluated on 
each rejector. However, as previously stated, the design of a problem specific cost function might not 
be feasible. Let p denote the trade-off between rejection and misclassification, thus defining a family of 
cost functions where a misclassification has a unitary cost, a rejection has a cost of p and an accurate 
classification has no cost. Then, there are three general cases where it is possible to perform comparisons 
between the performance of two rejectors independently of p\ when the number of rejected samples is the 
same; when the number of accurately classified samples not rejected is the same; and when the number 
of misclassified samples not rejected is the same. This is true for all p, if we assume that 0 < p < 1, 
which is a reasonable assumption as p < 0 would lead to a rejection only problem (all samples rejected), 
and p > 1 would lead to a classification only problem (no samples are rejected). Let C denote a classifier 
with an accuracy vector a, and Ri and R 2 denote two different rejection mechanisms that partition the 
sample space in and , TZr^ respectively. 

Equal number of rejected samples: If both rejectors reject the same number of samples, and if rejector 
Ri has a larger number of accurately classified samples than R 2 , then Ri outperforms R 2 . 


outperforms 


Equal number of nonrejected accurately classified samples: If both rejectors have the same number 
of accurately classified samples not rejected, and if rejector Ri rejects more samples than R 2 , then Ri 
outperforms R 2 . 







AcN 



Mr\n 


Equal number of nonrejected misclassified samples: If both rejectors have the same number of mis¬ 
classified samples not rejected, and if rejector Ri rejects fewer samples than R 2 , then Ri outperforms 
Ri. 


R\ i?2 


AnJ\f\ 

Ad n A/" 

outperforms^ 

Anfif 

Mn^ 


Adn7^ 


Ann 

Mnn 


C. Desired properties of performance measures 

The definition of the rejection problem as the partition of the accuracy vector a based on two disjoint 
supports J\f and IZ is general and allows us to define desired characteristics for any generic performance 
measure a that evaluates the performance of classification with rejection. 
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We start by introducing the rejected fraction r, as the ratio of rejected samples versus the overall number 
of samples, 


n 


k 


\n\ 


r = 


n 


|7?.| + jA/"! 


( 2 ) 


1} Property I: Performance measure is a function of the rejected fraction: The first desired characteristic 
of a performance measure a, is for the measure a to be a function of number of rejected samples, 


a = a{r). 


(3) 


2) Property 11: Performance measure is able to compare different rejector mechanisms working at the 
same rejected fraction: For the same classification C, and for two different rejection mechanisms Ri and 
R 2 , the performance measures a{C,Ri,r) for Ri and a{C,R 2 ,r) for R 2 should be able to compare the 
rejection mechanisms Ri and R 2 when rejecting the same fraction: 

rejection Ri rejection R2 

a{C,Ri,r) > a{C,R 2 ,r) Ri outperforms i? 2 - (4) 


3) Property 111: Performance measure is able to compare different rejector mechanisms working at 
different rejected fractions: On the other hand, it is also desired that the performance measure be able to 
compare the performance of different rejection mechanisms Ri and R 2 when they reject different fractions 
ri and r 2 , 

rejection Ri rejection R2 

outperforms i ?2 q;(C', i?i, ri) > q;(C', i? 2 , ^ 2 ). (5) 


4) Property IV: Maximum and minimum values for performance measures: Any performance measure 
should achieve its maximum when ff coincides with A and IZ with JVl, corresponding to simultaneously 
rejecting all misclassified samples and not rejecting any accurately classified samples (ajy = 0 and 
a-]^ = 1 are empty). Similarly, the performance measure should achieve its minimum when A/" coincides 
with j\4 and 'JZ with A, corresponding to rejecting all accurately classified samples and not rejecting any 
misclassified samples (aj^ = 1 and = 0 are empty). 


III. Performance measures 

We are now ready to define the three performance measures. First, we will show that the nonrejected 
accuracy, as used extensively in the literature, is a performance measure that satisfies all our properties. 
We will then present two other measures that also satisfy the same properties: classification quality and 
rejection quality. 


A. Nonrejected accuracy A 

The nonrejected accuracy measures the accuracy on the subset of nonrejected samples 


A = 


IIQatII 

n — k 



The nonrejected accuracy measures the proportion of samples that are accurately classified and not rejected 
compared to the samples that are not rejected. In a probabilistic interpretation, it is equivalent to the 
expected value of the conditional probability of a sample being accurately classified given that it was not 
rejected. 
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We can represent the nonrejected accuracy as a function of the rejected fraction, 


A = 



ajsf 


n(l — r) 


A{r), 


( 6 ) 


satisfying Property I. Properties II, III, and IV are also satisfied; the proof is given in the Appendix. 

We note that the maximum and minimum values of the nonrejected accuracy, 1 and 0 respectively, are 
nonunique. Two different rejectors can have a nonrejected accuracy of 1 if the nonrejected samples are 
all accurately classified. For example, if rejector Ri rejects all misclassified samples and does not reject 
any accurately classified samples, TZ = M, and rejector R 2 rejects all misclassified samples and some 
accurately classified samples, RDM, both their nonrejected accuracies will be 1. 


B. Classification quality Q 

The classification quality measures the correct decision making of the classifier-rejector, assessing both 
the performance of the classifier on the set of nonrejected samples and the performance of the rejector on 
the set of misclassified samples. This equates to measuring the number of accurately classified samples 
not rejected AcN and the number of misclassified samples rejected MClZ, 


Q = 


II T 111 ~ 


I|qa/'|| + l|i ~ O'Tt 

n 



In a probabilistic interpretation, this is equivalent to the expected value of probability of a sample being 
accurately classified and not rejected or a sample being misclassified and rejected. 

To represent the classification quality Q as a function of the fraction of rejected samples r, we analyze 
separately the performance of the classifier on the subset of nonrejected samples and the performance of 
the rejector on the subset of misclassified samples. The performance of the classifier on the subset of 
nonrejected samples is the proportion of accurately classified samples not rejected to the total number of 
samples, which can be easily represented in terms of the nonrejected accuracy as follows, 




llawll 


(1 — r) = A{r){l — r). 


n n(l — r) 

The performance of the rejector on the subset of misclassified samples is 


(7) 


«7?.| 


a 


ajq\ 


ajq\ 


n 


= 1 - .4(0) - i 

n n 

1 _ A(0) - = 1 - 4(0) - ^ + = 

n n n 

1 — A(0) — (1 — r) + A(r)(l — r) = —A(0) + r + A(r)(l — r). 


( 8 ) 


By combining (7) and (8), we can represent the classification quality as 

Q(r) = 2A(r)(l — r) + r — A(0), 


(9) 


satisfying Property I. Properties II, III, and IV are also satisfied; the proof is given in the Appendix. 

We note that both the maximum and the minimum values of the classification quality, 1 and 0 respec¬ 
tively, are unique. Q(r) = 1 describes an ideal rejector that does not reject any of the accurately classified 
samples and rejects all misclassified samples, A = Af and Ai = R. Conversely, Q{r) = 0 describes 
the worst rejector that rejects all the accurately classified samples and does not reject any misclassified 
sample, A = R and M. = N. 

We can use the classification as in (9) to compare the proportion of correct decisions between two 
different rejectors, for different values of rejected fractions. We note that as (5(0) = A(0), we can compare 
the proportion of correct decisions by using classification with rejection versus the use of no rejection at 
all. 
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C. Rejection quality (j) 

Finally, we present the rejection quality to evaluate the ability of the rejector to reject misclassified 
samples. This is measured through the ability to concentrate all misclassified samples onto the rejected 
portion of samples. The rejection quality is computed by comparing the proportion of misclassified 
to accurately classified samples on the set of rejected samples with the proportion of misclassified to 
accurately classified samples on the entire data set, 

/ 

As the rejection quality is not defined when there are no misclassified rejected samples, |a- 7 ^|| = 0, we 
define 0 = oo if any sample is rejected \R\ > 0, meaning that no accurately classified sample is rejected 
and some misclassified samples are rejected, and 0 = 1 if no sample is rejected \R\ = 0. To express the 
rejection quality as a function of the rejected fraction, we note that, by (1), we can represent the accuracy 
on the rejected fraction as ||a- 7 ^ = ||a|| — ||aA/'||> ||1 — a|| as n(l — A(0)). This means that 

^ r-A(0) + A(r)(l-r) A(0) ^ 

^ A(0)-A(r)(l-r) 1 - A(0) ’ 

satisfying Property I. Properties II, III, and IV are also satisfied; the proof is given in the Appendix. 

Unlike the nonrejected accuracy and the classification quality, the rejection quality is unbounded. A value 
of (j) greater than one means that the rejector is effectively decreasing the concentration of misclassified 
samples on the subset of nonrejected samples, thus increasing the nonrejected accuracy. 

The minimum value of 0 is 0, and its maximum is unbounded by construction. Any rejector that only 
rejected misclassified samples will achieve a (p value of oo, regardless of not rejecting some misclassified 
samples. 




IV. Quantifying performance oe a classieier with rejection 

With the three performance measures defined, we can now compare performance of classifiers with 
rejection. We illustrate this in Fig. 2, where we consider a general classifier with rejection. In the figure, 
black circles in the center correspond to a classifier that rejects 20% of the samples, with a nonrejected 
accuracy of 62.5%, a classification quality of 65%, and a rejection quality of 3.67; we call that black 
circle a reference operating point. 

A. Reference operating point, operating point and operating set 

A set of performance measures and the associated rejected fraction r correspond to a reference operating 
point of the classifier with rejection. Given a reference operating point, we define the operating set as the 
set of achievable operating points as a function of the rejected fraction. This further means that for each 
operating point of a classifier with rejection there is an associated operating set. 

Any point in the green region of each of the plots in Fig. 2 is an operating point of a classifier with 
rejection that outperforms the one at the reference operating point (black circle), and any operating point 
in the orange region is an operating point of a classifier with rejection that is outperformed by the one 
at the reference operating point (black circle), regardless of the cost function (assuming that the cost of 
rejection is never greater than the cost of misclassification). In white regions, performance depends on the 
trade-off between rejection and misclassification, and is thus dependent of the cost function. The borders 
of the green and orange regions correspond to the best and worst behaviors, respectively, of classifiers 
with rejection as compared to the reference operating point. Thus, given the reference operating point, its 
correspondent operating set is the union of the white regions including the borders. 







non rejected accuracy 


classification quality 


rejection quality 



(a) Nonrejected accuracy 



(b) Classification quality 



Fig. 2. Performance measures with outperformance (green) underperformance (orange) regions for a reference operating point (black circle). 
Reference classifier rejects 20% of the samples and achieves a nonrejected accuracy of 62.5%, classification quality of 65%, and a rejection 
quality of 3.67. /3 measures correctness of rejection; — 1 corresponds to the best and = —1 to the worst rejection behaviors, respectively. 


B. Relative optimality 

To compare the behavior of a classifier with rejection in the white region to that at the reference 
operating point, we measure how close that classifier is to the green and orange region borders (best/worst 
behaviors). Let /5 = 0 denote the curve that corresponds to the middle point between the best and worst 
behaviors (black curve in Fig. 2), /? = 1 to the best behavior (border with the green region), and f5 = —1 
to the worst behavior (border with the orange region). We call /5 relative optimality, as it compares the 
behavior of a classifier with rejection relative to a given reference operating point. 

Let us consider a reference operating point defined by a nonrejected accuracy Aq at a rejected fraction 
ro; we can now compare the performance at an arbitrary operating point (Ai,ri) with that at a reference 
operating point (^Oi^'o) 


/5 = 


) Ai(l-ri)-Ao(l-ro) , ^ 

ri-ro ’ 

_2 Ai(l—n)—Ao(l—rp) 2 

ri-ro ’ 


if ri > ro, 
if ri < ro- 


( 10 ) 


C. Cost function 

Furthermore, the relative optimality allows us to compare any two operating points of a classifier 
with rejection taking in account a cost function L which measures the relative cost of rejection versus 
misclassification. Let us consider the following generic cost function 

{ 0, yf accurately classified and not rejected; 

1, misclassified and not rejected; (11) 

p, yf rejected, 

where p is the cost of rejection and represents the trade-off between rejection and misclassification. We 
now connect the concept of relative optimality with the generic cost function L as follows. 

Theorem 1. For an operating point {Ai,ri) with a relative optimality of 13 relative to the reference 
operating point {AQ,rQ), and ri > vq, 

sgn(ALp) = sgn(Lp(Ao, ro) - Lp{Ai,ri)) = sgn , (12) 

where ALp is the difference between the cost function at the reference operating point Lp(Ao, Tq) and the 
cost function at the operating point Lp{Ai,ri). 
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Proof. Let ri > tq, then we have that the cost function at a generic operating point (A,r) is 

Lp{A, r) = (1 — r)(l — A)n + prn, 

as we have {l — r){l — A)n misclassified samples, {l — r)An accurately classified samples, and rn rejected 
samples, and thus 

ALp = n ((1 - ro)(l - Aq) + pro - (1 - ri)(l - Ai) - pn) 

= n (ri - ro - (1 - ro)Ao + (1 - ri)Ai + p(ro - ri)). (13) 

On the other hand, from (10), we have that 

^i(l - n) - ^o(l - ro) = ^^-^(ri - ro). (14) 

By combining (13) and (14), we have that 

ALp = n{ri - Tq) ■ (15) 

Because ri — ro and n are positive, ALp and (/5 + l)/2 — p have the same sign. □ 

The previous discussion allows us to compare a classifier with rejection Ri to the reference operating 

point Rq as follows. Let the operating point (^i, rf) be at relative optimality P with respect to the reference 
operating point (Ao,ro). Then, 

f Lp(Ai,ri) < Lp(Ao, ro), for p < (P + l)/2; 

[ Lp(Ai, ri) > Lp(Ao, ro), for p > (P + l)/2. 


D. Performance measures 

We can consider the classifier with rejection as two coupled classifiers if we considered the rejector 
to be a binary classifier on the output y of the classifier C, assigning to each sample a rejected or 
nonrejected label. Ideally, R should classify as rejected all samples misclassified by C and classify as 
nonrejected all the samples accurately classified by C. 

In this binary classification formulation, the classification quality Q becomes the accuracy of the binary 
classifier R, the accuracy of the nonrejected samples A becomes the precision (positive predictive value) 
of the binary classifier R, and the rejection quality f becomes the positive likelihood ratio (the ratio 
between the true positive rate and the false positive rate) of the binary classifier R. The rejected fraction 
becomes the ratio between the number of samples classified as rejected and the total number of samples. 

This formulation allows us to show that the triplet {A{r),Q{r),r) completely specifies the behavior 
of the rejector by relating the triplet to the confusion matrix associated with the binary classifier R. As 
we are able to reconstruct the confusion matrix from the triplet, we are thus able to show that the triplet 
(A(r), Q{r),r) is sufficient to describe the behavior of the rejector. 

Theorem 2. The set of measures {A{r),Q{r),r) completely specifies the behavior of the rejector. 

Proof. Let us consider the following confusion matrix associated with R: 

’ \Anfif\ |2\4nA/"! ' 

lAnni |A4n7^| J ’ 

where n denotes the total number of samples, \Ar\ff\/n the number of samples accurately classified and 
not rejected, lAlLAl the number of samples misclassified but not rejected, \Ar\7l\ the number of samples 
accurately classified but rejected, and \M. fi Ll\ the number of samples misclassified and rejected. Given 
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that n binary classifications classified n samples, the confusion matrix associated with R can be uniquely 
obtained from the following full rank system: 




' 0 0 0 1 ■ 


1 

IMnA^I 


1-10-1 


r 

|^n7^| 

— 77/ 

0 0 11 


Q{r) 

|M n7^| 


0 1-11 


A{r) (1 — r) 


Therefore, as the set of measures and the confusion matrix are related by a full-rank system, the set of 
measures {A{r),Q{r),r) completely specifies describes the behavior of the rejector. □ 

E. Comparing performance of classifiers with rejection 

Given a classifier C and two rejectors Ri and Rq, with ri > tq, and a cost function with a rejection- 
misclassification trade-off p, we can now compare the performance of classifiers with rejection. 

Rejector Ri outperforms Rq when the following equivalent conditions are satisfied: 

(ri) > Ar^ (ro) j _ ^° + (P - 1) >(5iio(ro) + (2p-l)(ri-ro). 

Rejector Rq outperforms Ri when the following equivalent conditions are satisfied: 

ARfiri) < ARfiro)]—— + (p - 1)^—— QrAti) < Qr^Tq) + {2p - l){ri - ro). 

1 — ri 1 — ri 

Rejectors Rq and Ri are equivalent in terms of performance when the following equivalent conditions 
are satisfied: 

ARfiri) = ARfiro)]—— + (p - 1)^—— QrAti) = Qr^Tq) + {2p - l){ri - ro). 

The proof is given in the Appendix. 


V. Experimental results 

To illustrate the use of the proposed performance measures, we apply them to the analysis of the 
performance of classifiers with rejection applied to synthetic and real data. We use a simple synthetic 
problem to motivate the problem of classification with rejection and to serve as a toy example. 

We then focus on the application of classification with rejection to pixelwise hyperspectral image 
classification [3], [4], [19], which is prone to the effects of small and nonrepresentative training sets, 
meaning that the classifiers might not be equipped to deal with all existing classes, due to the potential 
presence of unknown classes. Classification with rejection is an interesting avenue for hyperspectral image 
classification as the need to accurately classify the samples is greater than the need to classify all samples. 

A. Synthetic data 

As a toy example, we consider a classification problem consisting of four two-dimensional Gaussians 
with the identity matrix as a covariance matrix and centers at (±1, ±1). The Gaussians overlap significantly, 
as shown in Fig.3(a). This results in a simple classification decision: for each sample, assign the label of 
the class with the closest center as in Fig.3(b). 

We illustrate our performance measures by comparing two simple rejection mechanisms: (1) maximum 
probability rejector, which, given a classifier and a rejected fraction, rejects the fraction of samples with 
lower probability; and (2) breaking ties rejector, which, given a classifier and a rejected fraction, rejects 
the fraction of samples with lower difference between the highest and second-highest class probabilities. 

In Fig.4, we can see the performance measures computed for all possible rejected fractions for each of 
the two rejectors. It is clear that the with the accuracy-rejection curves alone, as shown in Fig.4(a), we 
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(c) Classification 20% rejection 
maximum probability 


(d) Classification 20% rejection 
breaking ties 


Fig. 3. Synthetic data example. Four Gaussians with equal covariance (identity covariance matrix) and significant overlap (centered at 
(zbljzbl)), classified with rejection (in black), (a) Ground truth, (b) classification with no rejection, (c) classification with 20% rejection 
using maximum probability rejector, and (d) classification with 20% rejection using breaking ties rejector. The differences between the two 
rejectors are clear near the origin. Note that the points are not uniformly distributed. 


are not able to single out any operating point of the classifier with rejection. On the other hand, with the 
classification quality in Fig.4(b), we can identify where the rejector is maximizing the number of correct 
decisions, and for which cases having a reject option outperforms not having rejection. As illustrated in 
Fig.4(c), the rejection quality provides an easy way to discriminate between two different rejectors, as it 
focuses on the analysis of the ratios of correctly classified to incorrectly classified samples on the set of 
rejected samples. 

The relative optimality plots for both rejectors are present in Fig. 5. For each possible operating point 
of the rejector, for simplicity defined only by the rejected fraction, we compute the relative optimality 
of all other operating points of the rejector. We note that, for both rejectors, the operating point that 
corresponds to the maximum classification quality, has a nonnegative relative optimality with regards to 
all other operating points. This relative optimality plot is of particular interest for parameter selection. 
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Fig. 4. Performance measures as a function of the rejected fraction for the synthetic example and the maximum probability rejector (solid 
blue line), and the breaking ties (dashed red line). 



(a) Maximum probability rejector (b) Breaking ties rejector 

Fig. 5. Relative optimality computed for all possible pairs of operating points of (a) maximum probability rejector and (b) breaking ties 
rejector 


B. Hyperspectral image data 

In hyperspectral image classification, the use of context, through the form of spatial priors, is widespread, 
providing significant performance improvements. This means that, after classification, a computationally 
expensive procedure is applied to classifier output to take into account contextual effects. The use 
of accuracy-rejection curves might not be feasible, as changes in the rejected fraction often imply a 
computationally expensive context computation procedure. Thus, due to the joint use of context and 
rejection, and the high computational costs associated, this is a perfect environment for the use of the 
performance measures. 

We use the algorithms for hyperspectral image classification with context and rejection presented in 
[19] in their joint (JCR) and in their sequential (SCR), versions, respectively. Both JCR and SCR are 
based on SegSALSA (Segmentation by Split Augmented Lagrangian Shrinkage Algorithm). SegSALSA 
consists of a soft supervised classifier assigning a class probability to each pixel of the image, followed 
by the application of context through the computation of the marginal maximum a posteriori of a 
continuous hidden field that driving the class probabilities, with a smoothness promoting prior applied on 
the continuous hidden field. For more details on the SegSALSA, see [20]. 
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(c) SegSALSA-JCR (d) SegSALSA-SCR 


Fig. 6. Indian Pine scene, (a) False color composition, (b) ground truth, (c) classification with context and rejection with SegSALSA-JCR, 
and (d) classification with context and rejection with SegSALSA-SCR. 


We can now introduce rejection as an extra class that models probability of classifier failure, resulting 
in a joint computation of context and rejection, the JCR version. We consider that the probability of failure 
is constant for all the pixels of the hyperspectral image, leading to a uniform weighting of the samples. 
The higher the probability of failure, the larger the rejected fraction. However, it is not possible to define 
a priori the amount of rejected fraction obtained, and any change in the value of rejected fraction implies 
the recomputation of the SegSALSA algorithm. 

On the other hand, we can harness the hidden fields resulting from the SegSALSA algorithm to obtain 
an ordering of the pixels in the classification with context according to their confidence. This results in a 
very fast rejection scheme that takes in account rejection, resulting from approximations to the problem 
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Fig. 7. Performance measures as a function of the rejected fraction for the Indian Pine scene and the SegSALSA-JCR rejector (dashed red 
line), and SegSALSA-SCR rejector (solid blue line). 


of joint computation of context and rejection, following a sequential approach to context and rejection, 
the SCR version. 

We apply the SegSALSA-JCR and the SegSALSA-SCR to the classification of a well known benchmark 
image in the hyperspectral community, the AVIRIS Indian Pine scene^, as shown in Fig. 6. The scene 
consists of a 145 x 145 pixel section with 200 spectral bands (the water absorption bands are removed) 
and contains 16 nonmutually exclusive classes. 

Following the approach in [19], we learn the class models using a sparse logistic regression with a 
training set composed of 10 samples per class, and, for the joint approach, perform a parameter sweep on 
the probability of classifier failure, obtaining various operating points of the ICR rejector. For the SCR 
rejector, as we define a posteriori the rejected fraction, obtaining operating points of the SCR rejector is 
simply obtained by rejecting the fraction of pixels with the least amount of confidence (smaller value of 
the posterior probability on of the hidden field). See [19] for a detailed explanation of the ICR and SCR 
schemes for rejection with context. 

As seen in Fig. 7, it is clear that, by looking at the accuracy rejection curves alone, it is trivial to 
compare the performance of the two rejectors when working at the same rejected fraction. However, we 
cannot draw any conclusions on which is the best operating point of each rejector, or how they compare 
to each other. By looking at the classification quality, it is clear where the maximum number of correct 
decisions is made for each of the rejectors, and by looking at the rejection quality we can observe that 
there is a significant improvement with reject options for lower values of the rejected fraction. 

Fig. 8 shows the relative optimality between each pair of operating points for each of the rejectors. 
Using (12), for a given reference operating point and for any test operating point, we can obtain the 
minimum value of po in the cost function (11) such that the cost function at the test operating point is 
smaller than than the cost function at the reference operating point. This means that, for any cost function 
with p < po, the test operating point is better than the reference operating point. We perform such analysis 
in Fig. 8, where we set the reference operating point as ro = 0, meaning no rejection. For each possible 
value of rejected fraction ri, we then test what the minimum value of the p such that no rejection is a 
better option than rejecting a fraction ri would be, according to the operating points defined by the two 
rejectors. 


VI. Conclusions 

We introduced a set of measures to quantify performance of classifiers with rejection. We then applied 
these performance measures to classifiers with rejection on both synthetic and real-world (hyperspectral 

^We thank Prof. Landgrebe at Purdue University for providing the AVIRIS Indian Pines scene to the community. 
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(a) Relative optimality JCR 


(b) Relative optimality SCR 



(c) Minimum rejection cost for 
no rejection JCR 


(d) Minimum rejection cost for 
no rejection SCR 


Fig. 8. Relative optimality computed for all possible pairs of operating points of (a) JCR and (b) SCR rejectors, and minimum value of p 
in the cost function (11), for (c) JCR and (d) SCR rejectors, such that it is better not to reject (ro = 0), for each operating point. 


image) data. Furthermore, we connected the performance measures presented with general cost functions 
through the concept of relative optimality. 
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VII. Appendix 


A. Properties 

Let us consider a classifier C and two different rejectors Ri and R 2 . 

1) Nonrejected accuracy A: 

a) Property I: The nonrejected accuracy is a function of the number of rejected samples (6). 

b) Property II: For the same rejected fraction r, we have that if the nonrejected accuracy for Ri is 
greater than the nonrejected accuracy for R 2 , then 




IIQaTaJI ^ 

(1 — r)n (1 — r)n 


AR,{r) □ 


meaning Ri outperforms R 2 . 

c) Property III: If Ri outperforms R 2 , for different rejected fractions ri > r 2 , then 
||avfij|, leading to 
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If Ri outperforms R 2 , for different rejected fractions ri < r 2 , then ||1 — || = ||1 — 
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d) Property IV: The nonrejected accuracy achieves its maximum, 1, when J\f = A and IZ = M. 
This maximum is not unique however. Any selection of Af such that Af C A achieves a maximum value 
of nonrejected accuracy. The minimum of the nonrejected accuracy, 0, is achieved when Af = A4 and 
71 = A. Any selection of Af such that Af d Ai achieves a minimum value of nonrejected accuracy. 

2) Classification quality Q: 

a) Property I: As seen in (9), the classification quality is a function of the number of rejected 
samples. 

b) Property II: With representation of the classification quality in (9), we can note that, for the same 
rejected fraction r if the classification quality for Ri is higher than the classification quality for R 2 , then 


Qri{'<') > Qr2{'<') 

Ar-^ > Ar^ 


2A^,(r)(l - r) - A(0) > 2A^,(r)(l - r) - A(0) 
lla\/-„ II > lla\/-„ II 


□ 


c) Property III: If Ri outperforms R 2 , for different rejected fractions ri > r 2 , then WclMr^ 
av^JI, and 
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If Ri outperforms R 2 , for different rejected fractions ri < r 2 , then ||1 — || = ||1 — ||, and 

nQRfiri) = ||aAr«J| + ||1 - a^j^^JI = ||aAr«J| + IR-rA - ||a7?,«J| = 
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d) Property IV: The classification quality achieves its unique maximum, 1, if A = Af and M. =71. 
Conversely, it achieves its unique minimum, 0, if A = IZ and M. = Af. 

3) Rejection quality (p: 

a) Property I: Let B{r) denote the rejected accuracy ||a7^||/|7?.|, we have 
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b) Property II: For the same rejected fraction r, we have that if the rejection quality for Ri is greater 
than the rejection quality for R 2 , then 
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c) Property III: If Ri outperforms R 2 , for different rejected fractions ri > r 2 , then HaAruJI = 
llaATuJI. As ||a|| = llaArll + ||a 7 e|| and ri > r 2 , we have ||a 7 j,^J| = and |7^i^l| > {IZr^ 
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d) Property IV: The rejection quality achieves its maximum, 00 , when Af = A and TZ = AA. This 
maximum is not unique. Any selection of TZ such that TZ C AA results in maximum values of rejection 
quality. Conversely, the rejection quality achieves its minimum, 0, when TZ = A and Af = AA. This 
maximum is not unique. Any selection of TZ such that IZ C A results in minimum values of rejection 
quality. 


B. Comparing performance of classifiers with rejection 

Let us consider a classifier C and two rejectors Ri and Rq, with ri > tq, and a cost function with a 
rejection-misclassification trade-off p. Let (3 be the relative optimality of the operating point of rejector 
Rl at ri with respect to the reference operating point of Rq at tq. 

From (16), and given the cost function with a rejection-misclassification trade-off p, we can relate 
outperformance, fi and p. 



19 


Rejector Ri outperforms Rq when 

P>2p-l. 

Rejector Rq outperforms Ri when 

P<2p-l. 

Rejector i?o and Ri are equivalent in terms of performance when 

/5 = 2p-l. 

1) Nonrejected accuracy A: We can represent AR^{ri) as a function of AR^{rQ) by noting that the best 
case scenario is 
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1 — ri 1 — ri 

corresponding to /5 = 1, and the worst case scenario is 
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2) Classification quality Q: We can represent as a function of (5i?o(^o) by noting that the best 

case scenario is 

QrAti) = QroAo) + (ri - rp), 

corresponding to /5 = 1, and the worst case scenario is 

QrAa) = QRoiro) - in - rp), 

corresponding to /5 = — 1. This results in a representation of the classification quality (5(-Ri)(ri) as 

QRiiri) = QRoin) + fiiri - rp), 

Rejector Ri outperforms Rq when 

QrAa) = QRoin) + Pin - n) > QRoiro) + (2p - l)(ri - rp). 

Rejector Rq outperforms Ri when 

Quiin) = QRoin) + Pin -n) < QRoiro) + (2p- l)(ri - rp). 

Rejector Rq and Ri are equivalent in terms of performance when 

QrAti) = QroAo) + Piri - rp) = QRoiro) + (2p - l)(ri - rp). 
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